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Abstract 

The recently published discrete mathematical method, extended consensus partition (ECP), identifies 
nucleotide types at each position that are strictly absent from a given sequence set, while occur in 
other sets. These are defined as discriminating elements (DEs). In this study using the ECP approach, 
we mapped potential hidden identity elements that discriminate the 20 different tRNA identities. We fil- 
tered the tDNA data set for the obligatory presence of well-established tRNA features, and then separately 
for each identity set, the presence of already experimentally identified strictly present identity elements. 
The analysis was performed on the three kingdoms of life. We determined the number of DE, e.g. the 
number of sets discriminated by the given position, for each tRNA position of each tRNA identity set. 
Then, from the positional DE numbers obtained from the 380 pairwise comparisons of the 20 identity 
sets, we calculated the average excluding value (AEV) for each tRNA position. The AEV provides a 
measure on the overall discriminating power of each position. Using a statistical analysis, we show that 
positional AEVs correlate with the number of already identified identity elements. Positions having high 
AEV but lacking published identity elements predict hitherto undiscovered tRNA identity elements. 
Key words: tRNA; identity element prediction; extended consensus partition 



1. Introduction 

In all organisms, the 20 aminoacyl-tRNA synthetase 
(AARS) enzymes have to recognize their amino acid 
substrates and the corresponding tRNA molecules 
with high precision to produce only legitimate ami- 
noacyl-tRNA products. This exquisite specificity is of 
central importance as this enables the genetic infor- 
mation to be faithfully translated into protein 
sequences by following the rules defined in the 
genetic code. Although principles and many fine 
details of this selective recognition event have 
already been discovered, 1-4 several questions 
remained still unanswered. 5 tRNA positions that 
have utmost roles in the selective interaction with 
the cognate AARS and thus define the identity of the 
tRNA are denoted as identity elements. While only 
laboratory experiments can decisively define the 



identity elements, the large number of potential posi- 
tions and the laborious nature of the experiments 
prompted a great variety of bioinformatics studies to 
predict such elements. These studies require large 
numbers of individual input tRNA sequences to 
locate statistically significant identity-related 
sequence properties. This magnitude of input data 
became available in the form of genomic DNA 
sequences, from which tRNA-detecting algorithms 6,7 
can identify functionally relevant tDNA sequences. 
Such analyses yielded numerous different tDNA data- 
bases. 8-10 Several computational studies reported 
successful functional annotation and in silico identity 
element determination. 1 1,12 Improved secondary 
structure-predicting algorithm-driven tRNA align- 
ments 13 yielded high-quality input data sets. These 
high-quality sets allowed for innovative sequence 
logo and inverse sequence logo-based analyses of 
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tRNA features and identity element predictions. 14 An 
information theory-based approach opened new 
frontiers in visualizing tRNA sequence features and 
predicting determinants and anti-determinants. 15,16 
Computational tRNA identity analysing methods 
were compared in a recent review of Ardell. 17 

In this paper, we introduce a new approach based 
on the recently published 'extended consensus parti- 
tion' (ECP) algorithm. 18 The ECP algorithm provides 
a discrete mathematical measure of pairwise dis- 
tances of functionally related aligned sequence sets. 
It was first introduced to reveal characteristic 
sequence features that discriminate the two tRNA 
sets corresponding to Class I and Class II AARS 
enzymes. 

In this study, we applied the ECP algorithm to assess 
the potential of each tRNA position to discriminate 
the 20 different tRNA identity sets from each other. 
The ECP method heavily relies on characteristic 
positional absence of nucleotide base types. Because 
of that, the method is sensitive even to the rare occur- 
rence of atypical sequences. For removing such 
sequences, we filtered the tDNA data sets for the 
obligatory presence of well-established tRNA features. 
Moreover, as all bioinformatic studies, the ECP analysis 
also requires a large number of input sequences. In 
this case, it is needed to reliably identify nucleotide 
types that are strictly absent from a given position of 
the aligned sequence set, i.e. their absence is not 
due to stochastic sampling error. 

In order to provide the necessary large input sets, 
we performed the ECP analyses on the three king- 
doms of life instead of individual species. 
Nevertheless, we aimed to compare tRNA identity 
sets that contain isofunctional sequences in spite of 
being originated from different species. Therefore, 
separately for each identity set, we further filtered 
the sets for the presence of experimentally verified 
strictly present identity elements. As a control experi- 
ment, we also performed the analysis by omitting this 
second filtration step. 

We argued that tRNA molecules sharing a large set 
of experimentally verified identity elements should 
interact with their corresponding AARS enzyme simi- 
larly and therefore should also share yet unidentified 
common identity elements. 

By combining the ECP method with simple statis- 
tics, we generated average excluding values (AEVs) 
providing a measure on the overall discriminating 
power of each tRNA position. We show that both 
with and without the second filtering step, positional 
AEVs correlate with the number of already identified 
identity elements. We argue that positions having 
high AEV but lacking already published identity ele- 
ments predict hitherto undiscovered tRNA identity 
elements. 



The analysis located such potential identity ele- 
ments on the anticodon arm (30:40 and 31:39) 
and suggests that the core region also contributes to 
defining tRNA identity. 

2. Materials and methods 

2.1 . Data set building 

The tDNA sequences of Bacteria and Eukaryotes 
were downloaded from the tRNAdb database. 9 The 
Archaea set of this database has not yet contained 
the recently discovered and characterized 1 9-22 split 
tRNA, which have been organized in the SPLITSdb 
database. 20 Split tRNA data are already included in 
the tRNADB-CE database, 10 which however (unlike 
Sprinzl and tRNAdb) does not contain aligned 
sequences. Therefore, we downloaded both normal 
and split Archaea tDNA from the tRNADB-CE database 
and aligned the sequences by the ClustalW software 
and manually as described by Fujishima et al. 23 ' 24 In 
the case of Archaea data set, we omitted the variable 
loop sequences from the analysis because of align- 
ment complications. 

The downloaded set was filtered for sequences that 
fulfil several criteria. 

2.2. First filtering step for all data sets 

In the first ECP-based study, 1 8 we used the database 
from the tRNomics study of Marck and Grosjean. 25 
Although for the present study, we used a larger, 
updated database, in the case of Bacteria and 
Eukarya, we could still use the well-established 
kingdom-specific strictly present elements— as filter- 
ing rules — defined by that tRNomics study. For 
Archaea, we used the filtering rules of Fujishima 
et al. 23 The obligatory presence of these elements 
established our first filtering criteria. The element 
sets for the three kingdoms were as follows. Bacteria: 
H14, G1 8, R1 9, Y33, G53:C61, T54, T55, Y56, D57, 
A58. Archaea: Y8, A14, G1 5, G18, G19, R21, T33, 
Y48, G53, T54, T55, C56, R57, A58. Eukarya: Y8, 
Y11, A14, -17a, G18, G19, R21, R24, H32, Y33, 
R37, H38, G53, H54, T55, C56, R57, A58, C61. 
(Note that nucleotides and their sets are denoted by 
IUPAC nucleotide codes.) Discarding sequences that 
lacked any of the strictly present kingdom-specific ele- 
ments removed incorrectly sequenced or most likely 
non-functional tDNA data. 

2.3. Second filtering step for the bacterial 
and eukaryotic data sets 

We grouped each sequence based on identity and 
filtered for the presence of already identified and pub- 
lished major, strictly present identity elements char- 
acteristic to the given amino acid identity set. 3 
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Figure 1. Illustration of calculating the AEV using short artificial sequences. (A) Identifying DEs using an artificial set of short sequences 
belonging to three amino acid identity sets. Identities are labelled as 1aa, 2aa and 3aa and are highlighted as cyan, yellow and 
magenta, respectively. Each identity set contains four short, tetramer sequences. The calculated DE is either one base or a 
combination of bases and it is labelled by IUPAC codes of bases or base sets. Each step of the DE-generating algorithm is explained 
in Section 2. (B) The calculated DE for each identity pairs. Note that the DE-generating relationship is non-symmetrical, i.e. the 1 aa 
vs. 2aa pair has a DE different from that of the 2aa vs. 1aa pair. (C) Positional summation of DE. Positional sum of the DE values 
(shown in the lowest row) provides the number of pairwise discriminations provided by the given position. The sums of the DE 
values are input data for calculating the AEVs). AEV is generated by dividing the positional sum of DE with the number of the 
identities (which in a real case is 20 while in this didactic case 3). Formalism and more detailed description are provided in 
Section 2. This figure can be viewed in colour online. 



These sets for Escherichia coli and Saccharomyces 
cerevisiae are listed in Supplementary Table SI. By 
excluding sequences that lacked these elements, we 
aimed to generate tRNA identity sets that contain iso- 
functional sequences expected to function in E. coli or 
yeast, respectively. We argued that if hidden identity 
elements still exist, those could also be shared by 
these filtered sequences. 

Out of the published identity elements, only the 
determinants were included, while the anti-determi- 
nants were not considered. One determinant, the 
G15:G48 Levitt pair of tRNA Cys , was omitted from 
the filtering as this is idiosyncratic to £. coli (more 
accurately, it could have emerged in the common 
ancestor of E. coli and Haemophilus influenzae). 26 



2.4. Third filtering step for all data sets 

Finally, in order not to bias the statistical analysis, 
we removed any redundancy from the data set by 
keeping only unique sequences. 

Supplementary Table S2 organized by the three 
kingdoms contains the species list corresponding to 
our raw data set and indicates the number of 
sequences contributed by each species in the raw 
set and after each filtering step. 

Supplementary Data S1 shows the resulted sets 
after the final filtering step. It contains six databases 
in a multi-fasta format, two for each kingdom. For 
each kingdom, one database contains a set of non- 
redundant sequences, while the other set contains 
all sequences minus the non-redundant set, thus it 
contains all 'siblings'. 



2.5. Determination of discriminating elements 

by the extended consensus partition algorithm 

Filtered data sets for each kingdom were analysed 
by the already published extended consensus parti- 
tion (ECP) algorithm. 18 Principles of the analysis and 
the algorithm remained the same. However, in this 
case, not the two AARS-based tRNA classes were com- 
pared, but all pairs of the 20 tRNA identity sets. The 
logic of the algorithm is illustrated in Fig. 1. Because 
the pairwise ECP analysis is non-symmetrical, from 
the filtered data set and separately for each 
kingdom, we produced all 380 (20 x 19) identity 
pairs. For each pair, we identified the discriminating 
elements (DEs) through the ECP algorithm as follows. 

In each identity set and for each kingdom, we 
scanned the positions of the filtered and aligned 9 
data set. At each position, we documented the strictly 
absent elements, i.e. bases that at the given position 
are missing from each sequence of the given identity 
set. For each detected strictly absent element, we 
checked each other identity set in a pairwise 
manner whether any of the sequences of the other 
identity set contains that element. If yes, the detected 
strictly absent element is a DE that discriminates the 
two identity sets. For each kingdom and each position, 
these pairwise-interpreted DE elements were 
documented. 

Mathematical description of the above procedure is 
described below. 

We introduce the variable, Y. Elements of Y are 
nucleotide bases; therefore, elements are YGx, 
where x = {A,C,G,T}. The value of Y\ k is the nucleotide 
base corresponding to position ;'(;'= 1 L = 96; 
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from normal position 0 to position 73, including the 
extra positions from e1 to e22) of the sequence 
k (k = \ ,. . ., Mj, where the value of Mi varies for indi- 
vidual species) of amino acid identity / (/=!,..., N, 
N=20). 

Then, we introduce the set of bases existing at the 
position j of identity /: 

Y i j :={Y j jk \k=\,...,M i } 

DE of identity set / against identity set / (where 
/= 1 , . . ., N, N = 20) at position j is defined as follows: 

4 ■■= (xvi) n y \ 



2.6. The AEV 

We introduced the AEV to determine the weighted 
average frequency of DE at each position as follows. 
At each identity set and at each position, we deter- 
mined how many identity set pairs are discriminated 
by the given position. These numbers from each set 
were summed up and were divided by 20 (the 
number of identities), resulting in the AEVs that 
demonstrate the discriminating potential of each 
tRNA position. 

In mathematical terms, the AEV is defined by the 
following functions: 

1 N N 

i=i /=i 
/#/ 

The n 1 value is denoted as the AEV. 

2.7 . Statistical analysis 

We performed a simple statistical analysis to assess 
how the positional AEVs relate to the number of hith- 
erto identified identity determinants. We assume that 
unidentified determinants still exist; therefore, the 
determinant set is only a subset of all existing ele- 
ments. The test was done only on the bacterial set 
because only that sequence set contained enough 
input sequences. We correlated the AEVs with the 
number of published determinants (NPDs) with 
both the Pearson and Spearman analyses. 

As both types of correlation were relatively weak, 
we also performed a bootstrap analysis as explained 
in Supplementary Data S2. In addition, the positional 
NPD and AEVs were compared with their respective 
weighted average values and ranked from very low 
to very high values based on standard deviation. 



3. Results and discussion 

The ECP analysis was performed both on the final 
filtered data sets and omitting the second filtration 
step (see later). At each position, the AEV was calcu- 
lated and compared with the positional NPD value. 
Positional AEV/NPD values for bacterial and eukary- 
otic data and AEVs for Archaea are listed in 
Supplementary Table S3. In Fig. 2 at each position, 
AEV and NPD values were plotted and colour coded 
in the same diagram as explained in the figure legend. 

The positional AEV/NPD patterns show character- 
istic similarities. In the following paragraphs, we orga- 
nized the results from the highest to the lowest AEVs. 

3.1 . Bacterial (coli-like) data set 

The density function of the AEVs in the bacterial set 
shows a normal distribution asthe numberof positions 
having values over the weighted average plus 0.5 SD 
(31 positions) practically equals those that are below 
the weighted average minus 0.5 SD values (32 posi- 
tions). In order to facilitate the visual comparison of 
NPDand the AEVsat each position, these are illustrated 
in a composite plot as shown in Fig. 2A and B. 

Importantly, two anticodon positions, 3 5 and 36, 
have the highest AEVs (located in the red zone) and 
these have the highest NPD values as well. On the 
other hand, position 34 (located in the yellow 
zone), which pairs with the third, wobbling codon 
position, has significantly (over one sigma) lower 
discriminating potential. Notably, out of the three 
anticodon positions, this contains the lowest NPD as 
well. Most positions with high AEVs (located in the 
yellow zone) give place to known identity elements 
and it is also clear that the AEVs of positions that 
base pair with each other correlate. In the acceptor 
arm, three position pairs contain determinant for 
many identities. These are 1:72 (Trp, Gly, Thr, Gin), 
2:71 (Met, Trp, Asp, Gly, Ser, Cys, Ala, Gin) and 3:70 
(Val, Met, Trp, Gly, Ser, Cys, Ala, Gin), which as a set 
carry known determinants for half of the identities. 3 
The discriminator base in position 73 has the third 
highest AEV right behind positions 35 and 36. 

In the tRNA core region, 27 the 12:23 pair having 
high AEVs overlaps with the know tRNA 1 ' 6 identity 
element T12:A23. Furthermore, high AEV positions 
13,22 and 46 have a published role as tRNA Glu iden- 
tity element T1 3:G22:A46 28 " 30 and deletion of 
position 47 with high AEV was also identified as 
tRNA Glu determinant. The 1 3:22 pair has been identi- 
fied as determinant for tRNA Cys as well. 31,32 

Other positions having higher than average AEVs host 
several identity elements as follows. Position 38 in the anti- 
codon loop contains determinant for tRNA lle ,tRNA Asp and 
tRNA Gln , 33,34 while positions 1 0 (t R N A Asp , t R N A G ' n ) ; 1 1 :24 
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Figure 2. Results of the Bacterial (A and B); Eukaryote (C and D) and Archaea (E and F) data set. ECP-diagrams (A, C, E). In each ECP 
diagram, each column belongs to a position. The upper column set refers to positional AEVs. The colour codes for statistical analysis. 
For each kingdom, the weighted average of AEV calculated for all positions from 0 to 73 as a single set and the corresponding 
standard deviation is as follows: Bacteria 6.45 ± 3.54; Eukaryote 6.31 ± 3.49 and Archaea 6.72 + 3.69. Segments of columns are 
colour coded based on their deviation from the weighted average as follows: below - 1 .5 SD blue; between - 1 .5 and -0.5 SD cyan; 
between -0.5 SD and +0.5 SD green; between +0.5 and +1.5 yellow; above +1.5 SD red. White illustrates positions where the 
analysis is not applicable. These are the 3' CCA end for each kingdom, and the unpopulated e1 position in the Achaea set. The 
lower column set illustrates the positional NPDs. The logic of the colour-coding scheme is the same as for the AEVs. For each 
kingdom, the weighted average of NPD calculated for all positions from 0 to 73 as a single set and the corresponding standard 
deviation is as follows: Bacteria 1.63 + 3.63; Eukaryote 0.49 ± 1.7. Archaea have too few experimentally verified determinants 
(listed in Table 2) to perform this analysis. ECP-cloverleaf (B, D, F). The cloverleaf structure illustrates spatial relationships of many 
base-pairing residues. Each position is illustrated as a circle. The upper half of the circle corresponds to the AEV and its colour coding 
is the same as for the corresponding ECP diagram. The lower half of the circle corresponds to the number of published identity 
elements. The colour coding is similar to that in the corresponding ECP diagrams except for positions where the NPD is zero. These 
positions are indicated as gray. White illustrates positions where the analysis is not applicable. These are the 3' CCA end for each 
kingdom, and the single unpopulated position e1 in the Achaea set. This figure can be viewed in colour online. 



(tRNA Ser ,tRNA G ' u ); 1 5:84 (tRNA Cys ,tRNA Pro ); 20 (tRNA phe , At the more conserved T-loop position 60 having 
tRNA Arg ,tRNA Ala ); 29:41 (tRNA lle ) and 45 (tRNA phe ) also average AEV, there is another tRNA phe deter- 
have a few indicated determinants. 3,35 minant. 3,35 The variable loop position having the 
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highest, but otherwise average, AEV is e2, which is a 
known identity element of tRNA Ser . 36 

The position pair 31:39 at the anticodon loop has 
no known identity element associated. In 2338 out 
of the 2406 analysed non-redundant bacterial 
sequences, these form normal Watson-Crick base 
pair, but based on the high AEVs, the exact identity 
of the given base pair might serve an identity 
element function. 

Based on the high AEV but low NPD values, position 
pairs 12:23 and 13:22 as well as Position 46 might 
contain hitherto unidentified determinants. 

The facultative elements (1 7, 1 7a, 20a and 20b) in 
the D-loop and in position 47 might also have iden- 
tity functions (that will be detailed in Section 3.5). 

Several elements with the lowest AEVs indicated in 
blue in Fig. 2A and C do not show any discrimination 
potentials. These are functionally highly conserved 
elements shared by all tRNA sequences and most of 
these were used in the first filtering step. Positions 
with still lower than average AEVs highlighted with 
cyan or those with average AEV highlighted as green 
typically coincide with regions that do not contain 
known identity elements. For example, at the 6:67 
pair, the NPD is zero. Nevertheless, there are some 
exceptions. For the 5:68 position pair, a single identity 
element has been published, the A5T68 for 
tRNA Met . 37 Moreover, the conserved A3 7 in the anti- 
codon loop has been identified as lie, Met, Glu and 
Gin determinant, 27,37-40 although over half of the 
tRNA sequences harbour adenine at this position. 
Finally, in spite of being located at low AEV regions, 
the U8:A14 for tRNA Leu41 and the G2 7:C43, 
G28:C42 and T59 for tRNA phe have also been shown 
to be determinants. 42 



3.2. Eukaryotic (yeast-like) data set 

While the bacterial data set represented coli-like 
identity rules, the eukaryotic set contained sequences 
that conformed to the already established yeast iden- 
tity rules. The results are illustrated in Fig. 2 in the 
same way as for the bacterial set. 

Just like in the case of the bacterial set, the majority 
of known determinants are located at positions 
having the highest AEVs, namely the three anticodon 
positions and the discriminator position 73. The 
three base pairs of the acceptor arm have the next 
highest NPD values and these also represent higher 
than average AEVs. Another high AEV position where 
a single known identity element for tRNA phe exists is 
position 20. 43 

There are several positions where higher than 
average AEVs exist, but no determinants have been 
identified. At some of these positions such as 1 2:23; 
1 3:22 and 45; 46; 47 that participate in establishing 



the core region, there are already published determi- 
nants for the bacterial set. 

Interestingly, the high AEV positions 31:39 and 
30:40 in the anticodon stem contain no identified 
yeast determinants, but harbour published determi- 
nants for human tRNA phe 44 

The lower than average AEVs indicated with blue 
and cyan colours in Fig. 2C and D are again all asso- 
ciated with conserved positions lacking any known 
identity elements. Very few published identity ele- 
ments coincide with positions having average AEVs 
indicated with green in Fig. 2C and D. Position 3 har- 
bours determinants for tRNA Gly and tRNA Ala , the anti- 
codon loop position 37 for tRNA Leu and positions 38 
and 1 0:25 for tRNA Asp . 45 " 49 Position 70, which pairs 
with position 3, has a higher than average AEV. 

3.3. Testing potential biases introduced by the second 
filtering step 

We have checked how the second filtering step 
alters the input sequence set from various species. 
For bacteria, the effect of filtering nicely mirrored 
evolutionary relations. Data in Supplementary Table 
S2 show that after the second filtering step, species 
containing the highest number of retained sequences 
are the closest relatives of £. coil. Namely, from the 
Gammaproteobacteria class (genera Escherichia, 
Haemophilus, Salmonella, Yersinia, Buchnera, Shigella), 
75-90% of the input sequences are retained by the 
second filtering step. Moving away from £. coli on 
the phylogenetic tree, the proportion of retained 
sequences gradually decreases. Still high (around 
70%) proportion of sequences are retained in the 
case of Proteobacteria (genera Desulfovibrio , Brucella, 
Campylobacter), but, for example, in the case 
of Firmicutes (genera Streptococcus, Bacillus, 
Lactobacillus, Lactococcus, Staphylococcus) only 50- 
70%, while in the case of Tenericutes (Mycoplasma, 
Ureaplasma) or Actinobacteria (Mycobacterium, 
Streptomyces), only 30-50% of the input sequences 
are retained. 

In the case of Eukarya, no such trend was observed. 
This might be due to the— compared with bacteria- 
much lower number and possibly more general 
nature of determinants published for yeast. 

The positional AEVs measure the average distance of 
functionally defined sets. One might think that the 
second filtering for the presence of identity elements 
could increase the separation of the identity sets and 
thus increase the AEVs at identity element positions. 
In order to assess this potential effect, we repeated 
the analysis for the bacterial and the eukaryotic sets 
by omitting the second filtration. The data are orga- 
nized in Supplementary Table S3, Table 1 and 
Supplementary Figure S1. 
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Table 1. Representative data for tDNA sequence set processing and analysis with and without filtering for the presence of known 
identity elements 



Bacteria 



Euka ryota 



Without 3 second 
filtration 



With second 
filtration 



Without second 
filtration 



With second 
filtration 



Archaea 

No second 
filtration 



No. of sequences 

Raw data 6243 

First filtration 6144 

Second filtration — 

Non-redundant data set 3946 

AEV 

Average 6.45 

SD 3.54 

Pearson b (R) 0.53 

Spearman b (p) 0.39 
Bootstrap 0 

Mean 224 

SD 16.9 

Cumulative AEV threshold 358.95 

Significance (P) 1.33e _l 



3901 
2406 

5.59 
3.51 
0.55 
0.54 

258 
1 7.1 

344.55 
3.54e" 7 



2222 
1 930 

1 495 

5.79 
3.43 



1 672 
1 264 

6.31 
3.49 



1552 
1 384 

1 041 

7.34 
3.97 



a The optional second filtration was performed based on the presence of experimentally verified £. colt (for Bacteria) or yeast 
(for Eukaryota) identity elements. Archaea lack enough verified identity elements; therefore, the second filtration was not 
performed. 

b Correlation of the NPDs and the AEVs was done as described in Section 2. 
The bootstrap analysis is described in Supplementary Data S2. 



Briefly, in the case of the bacterial data set, where 
40 positions carry known identity elements, second 
filtration removed 39% of the input sequences, e.g. 
those that lacked at least one required coli identity 
element. As a result, at positions, where known iden- 
tity elements exist (positive NPD), the sum of the AEV 
increased with 22%, while at positions with no 
reported identity elements (zero NPD), the increase 
was only 8%. Out of the 40 positions, this AEV increase 
exceeded the standard deviation at 1 0 positions. 

In the case of the eukaryotic data set, where only 1 5 
positions carry known identity elements, second filtra- 
tion caused a much smallereffect. It removed only 1 5% 
of the input sequences, e.g. those that lacked at least 
one required yeast identity element. Asa result, at posi- 
tions where known identity elements exist (positive 
NPD), the sum of the AEV decreased with 5%, while at 
positions with no reported identity elements (zero 
NPD), the decrease was 9%. However, this decrease 
was statistically significant only at position 23, where 
no identity element has been reported. 

For the bacterial set, we performed several statistical 
analyses both in the case of the second filtered and 
non-filtered data sets (Table 1). The Pearson correl- 
ation gave an /?-value of 0.55 for the filtered and 0.53 
for the non-filtered case, while the Spearmen correl- 
ation yielded p-values 0.54 and 0.39, respectively. 



We also applied a bootstrap analysis to test the stat- 
istical significance of high AEV positions overlapping 
with positions harbouring known identity elements 
(Table 1, Supplementary Data S2). Importantly, the 
overlap was statistically significant both in the non-fil- 
tered and in the filtered case, demonstrating that 
omitting the second filtration step did not change 
the overall distribution of the AEVs. The highest AEVs 
are at the anticodon positions, the discriminator 
base and the acceptor arm (1:72; 2:71). These are 
well-known positions of identity elements. 

This suggests that this filtration does not introduce 
artefacts. On the other hand, as we emphasized, only 
the second filtration yields a data set, in which 
sequences belonging to the same identity set share 
known identity elements, suggesting that they might 
also share yet unidentified common identity elements. 

3.4. Archaea data set 

The second filtration step was not performed for 
the Archaea data set as only sparse experimental 
data are available for such elements (listed in 
Table 2). The only comprehensive analysis available 
for each identity set is an in silico study based on 
sequence alignments. 50 

Positions of typical determinants, such as the 
discriminator base or members of the anticodon, 
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Table 2. Experimentally determined and published Archaea identity elements 



Identity 


Element 


Species 


Refs. 


Ala 


G3:U70 


Archaeoglobus fuigidus; Pyrococcus borikoshii 


71,72 


Asp 


r~"D C 


Pyrococcus kodakarciensis 


/ o — / b 


u ly 


Lod, Cob, Lz.Lj/ I , Lio.u/U 


Aeropyrum permx K1 


/ b 


l-J ir 

HIS 




Aeropyrum permx K1 


7 7 


Phe 


G34,A3 5A3b,A73,G20 


Aeropyrum pernix K1 


55 


Pro 


G35,G36,A73,G1:C72 


Aeropyrum pernix K1 


78 


Ser 


G30:C40, G73, variable loop G1 :C72, C3:G70 
variable loop 


Methanosarcina barken (Archaeal RS); Methanococcus 
maripaiudis 


56,79 


Thr 


U73, C2:G71 


Haloferax voicanii; Aeropyrum pernix K1 


54,80,81 


Trp 


C34, C35, A36, A73, G1 :C72, G2:C71 


Aeropyrum pernix K1 


82 


Tyr 


C1:G72,A73 


Aeropyrum pernix K1 


57,58 



have significantly higher than average AEVs (yellow 
and red zones in Fig. 2A and C). For some Archaea 
species and some identity sets, these have been 
experimentally verified as determinants (see in 
Table 2). However, it is noteworthy how differently 
the individual kingdoms (or sometimes groups 
within a kingdom) use identity elements on the 
acceptor stem. This can be illustrated through the 
example of tRNA Thr . 

In the case of £. co//, 51 the discriminator base is not 
a tRNA Thr determinant, while the first three base pairs 
(1:72; 2:71; 3:70) are identity elements. The most 
important one is the 2:71 pair. For yeast, 52 the 
discriminator base and the first and third acceptor 
stem base pairs are determinants, and similar results 
were published for the Thermus thermophilus 
bacterium. 53 Studies on the tRNA Thr identity elements 
for two Archaea species revealed that their discrimin- 
ator base and acceptor stem base pairs are used 
differently. In the case of Haloferax volcanii, these 
were used similarly as in yeast and T. thermophilus, 
while in Aeropyrum pernix, these elements were used 
as in £. co//. 54 Based on the AEVs at this area in the 
archahea set, the majority of the Archaea might 
have an identity element distribution like that in 
A. pernix. Moreover, the role of the discriminator 
base relative to the anticodon set appears to be dam- 
pened in this kingdom, suggesting that it might have 
roles in fewer identity sets than in the other two 
kingdoms. 

The 3:70 base pair positions have excelling AEVs 
and this pair was shown to be determinant for 
tRNA Ala , tRNA Gly and tRNA Ser in a few Archaea 
species (Table 2). Position 20 also has a higher than 
average AEV (yellow zone) and this position carries 
an identity element for tRNA phe in Archaea. 55 

The position pairs 29:41 and 31:39 at the anti- 
codon stem also have higher than average AEVs, yet 
no determinants have been identified at these 



positions. On the other hand, in the case of 
Methanosarcina barker!, at the 30:40 pair having 
average or low AEV, a Ser identity element has been 
found. 56 Just like in the case of the bacterial set, 
there are several positions in the core region and at 
the facultative base positions where the AEVs are 
higher than average, yet no identity element has 
been published. 

Similar to the bacterial set, the AEV of position 34 is 
lower than those of positions 35 and 36. However, in 
this case, position 34 has a much lower, only average 
AEV (green zone), while in bacteria, it belonged to the 
highest category (red zone). 

Positions having the lowest AEVs, just like in the case 
of the bacterial data set, are located at conserved 
tRNA architecture defining positions. A noteworthy 
difference compared with the bacterial data is that 
in Archaea, the 1:72 position pair has only average 
AEVs (Fig. 2A and C). This appears to be due to the 
fact that in ~90% of the Archaea tRNA sequences, 
there is a G1:C72 pair at this position. Based on the 
already published sequence analysis, 50 the only few 
exceptions are detected for the initiator tRNA Met and 
for tRNA Gln and tRNA Tyr . The C1 :G72 pair of tRNA Tyr 
has been experimentally verified to be a genuine iden- 
tity element. 57,58 

3.5. Potential hidden identity elements 

Positions having higher than average AEVs but con- 
taining no published determinants that would have 
been used in our second filtering step might 
harbour hitherto unidentified determinants. The 
most likely hidden identity elements are illustrated 
in Fig. 3. We searched the literature for any identified 
determinants at such positions reported for species 
other than coli-like bacteria or yeast. Such cases 
were already mentioned above for the 30:40 and 
31:39 base pairs that are identified human determi- 
nants. 44 Therefore, we analysed these two base pairs 
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Figure 3. Hitherto unidentified potential identity elements. 
Positions having high AEV but low NPDs could harbour 
hitherto unidentified identity elements. Grey or black 
highlights the most likely positions of such elements. For 
positions highlighted as black, we also propose the 
corresponding identity: bold corresponding to yeast, while 
italic to E. coli. Middle grey highlights 'core region' positions 
having high AEV and low NPD in all three kingdoms. Light grey 
highlights the highest AEV base pair in the Archaea kingdom. 



by checking our data set for each identity set. As a 
simplest scenario, we searched for base pairs that, in 
any one of the 20 identity sets, differ from all other 
1 9 sets. In the bacterial set, there was a single such 
case, a normal Watson-Crick base pair, while for 
the eukaryotic set, all such pairs were non-Watson- 
Crick base pairs. 

3.5.1. Escherichia coli tRNA Trp 73 1 :A3 9 Th is base 
pair is present in the tRNA Trp sequences in all but a 
very few species that provided tRNA Trp sequences to 
our filtered coli-like data set. In four coli-like species, 
only a single tRNA Ser , while in two species, only a 
single tRNA Gln , has the same base pair. We note that 
only the anticodon loop, the discriminator base and 
the acceptor stem were experimentally tested for 
coli tRNA Trp determinants, 59 " 61 thus the T31:A39 
pair might be a hitherto undetected identity element. 

3.5.2. Yeast Met T3V.T39 This unconventional 
base pair is characteristic to eukaryotic elongator 
tRNA Met . Both eukaryotic as well as coli initiator 
tRNA Met and the coli elongator tRNA Met contain a 
normal G31:C39 pair at these positions. In the coli 
initiator tRNA, the normal G31:C39 base pair is 
required for binding to the P site on the ribosome 
and for protein synthesis initiation, while in the coli 



elongator tRNA Met , the same pair is required for 
proper acylation by the corresponding AARS enzyme. 
Eukaryotic initiator tRNA Met from a wide source were 
shown to be good substrates of the coli Met-RS 
enzyme, while the eukaryotic elongator tRNA Met 
from the same wide source were all poor substrates. 
When the original T31:T39 pair in the eukaryotic 
tRNA Met was replaced with a G31:C39 pair, it 
became a good substrate for the coli enzyme, and 
symmetrically, when the G31:C39 pair was replaced 
with a T31:T39 pair in the coli elongator tRNA Met , it 
became a good substrate for the eukaryotic enzyme. 
Thus, the kingdom-specific base pair is a determinant 
in both kingdoms. Moreover, it was also shown that in 
the elongator tRNA, it is not the identity of the bases 
that matter, but the fact whether these form a 
Watson-Crick base pair or not. If they do, the corre- 
sponding tRNA Met is a good substrate of the bacterial 
enzyme and a poor substrate of the eukaryotic 
enzyme. If they do not, the corresponding tRNA Met is 
a good substrate for the eukaryotic enzyme and a 
poor substrate for the bacterial enzyme. Thus, a 
Watson-Crick base pair at these positions or the 
lack of it affects the structure and/or malleability of 
the anticodon loop, which has an important role in 
the proper interaction with the AARS enzyme. 

It has also been shown that replacing the G31:C39 
base pair with a T31T39 pair in the yeast initiatior 
tRNA renders it being able to participate in the elong- 
ation phase. 

Note that this base pair has not been defined as a 
determinant per se, as the studies compared either 
isospecific elongator tRNAs from two different king- 
doms or elongator vs. initiator tRNAs rather than 
two different elongator tRNA identities from the 
same species. 62-64 

3.5.3. Yeast lie T30:G40 In the case of yeast 
tRNA lle , only the three anticodon positions were 
studied as identity elements, 65 and the potential 
role of this nearby non-Watson-Crick base pair has 
not been tested. In our data set, this base pair is 
present in the majority of eukaryotic species from 
Arabidopsis through Drosophila to Human for 
several lie isoacceptors, suggesting that this base pair 
might have an identity defining role. 

3.5.4. Yeast Asp G30:T40 At the same positions 
as for tRNA lle , a different non-Watson-Crick base 
pair is present in yeast tRNA Asp sequences. This base 
pair is not conserved in eukaryotic species. In our 
data set, it is present only in yeast and 
Caenorhabditis elegans. As high AEVs suggest that the 
G30T40 pair might be an identity element in yeast 
(and also in C. elegans), we checked the PDB for 
other yeast tRNA/AARS complex structures. 
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Figure 4. Comparison of tRNA Asp /AspRS structures from £. coli and yeast. (A) Structure of the £. coli tRNA Asp /AspRS complex 68 ; PDB: 1 1L2. 
(B) Structure of the yeast tRNA Asp /AspRS complex 66 ; PDB: 1 ASY. Colour coding: AARS, 'wheat'; tRNA, pale cyan; Lys58 (£. coli, panel A) 
and Lys88 (yeast, panel B) and the G30C40 (£. coli, panel A) and G30:U40 (yeast, panel B) base pairs are green. Other bases illustrated 
as sticks are already published identity elements. See Section 3 for details. This figure can be viewed in colour online. 



Unfortunately, besides the Asp, 66 only two other 
structures are available. One is for tRNATyr (PDB: 
2DLC), which is incomplete, and another one for 
tRNAArg. 67 Both have a normal G30:C40 base pair. 
Note that ArgRS and TyrRS belong to Class I, while 
AspRS belongs to Class II. As enzymes from the two 
classes approach the tRNA from opposite sides, we 
did not expect any similarities in the enzyme-tRNA 
interactions themselves. 

As could be expected, the interactions are not 
homologous. In the case of tRNA Asp , the ribose-phos- 
phate portion of G30 is in 2.9-3.3 A (there are two 
complexes in the unit cell) distance from the Lys88 
side chain of the enzyme, which suggests a salt 
bridge between the tRNA and the enzyme at this 
position. The equivalent residue of the AspRS Lys88 
in the ArgRS is Lys78, which does not interact with 
the tRNA. On the other hand in the tRNA Arg structure, 
the phosphate of C40 appears to form a H-bond with 
Ser440 of the enzyme, which is in 2.6 A distance. The 
incomplete tRNA Tyr structure either does not have an 
interaction with the enzyme at this position or it is 
not resolved in the model. It is plausible that a G:T 
pair is disrupted more readily than a G:C pair. It 
appears that such local disruption of a G:T pair is 
required for forming a stable salt bridge with the 
enzyme in the case of tRNA Asp . On the other hand, 
tRNA Arg has a different stabilizing interaction, which 
does not necessitate such a local perturbation. 

We also checked whether a different base pair exists 
at these positions in bacterial tRNA Asp and if so, 
whether it affects the recognition mode by their 
respective AARS enzymes. In the coli tRNA homo- 
logue, there is a traditional G30:C40 pair and fortu- 
nately there is a tRNA:AARS complex structure for 
bacterial tRNA Asp in the PDB, 68 so we could compare 
the two homologous interactions from the two 
different kingdoms. Differences are shown in Fig. 4. 



As already mentioned, in the yeast system, the 
ribose-phosphate portion of G30 of the tRNA forms 
a salt bridge with the Lys88 side chain of the 
enzyme. A Needleman-Wunsch alignment of the 
coli and yeast synthetase sequences identified Lys58 
as the coli equivalent of Lys88. This side chain is in 
1 0.3 A distance from the G30 phosphate moiety 
therefore no salt bridge is formed. This comparison 
clearly shows that identical tRNA positions are differ- 
ently recognized by the two iso-functional AARS 
enzymes. 

Furthermore, the observed direct interaction of the 
G30T40 pair with the AARS enzyme suggests that it 
might be an identity element in yeast. 

3.5.5. Core region In £. coli, it has been shown 
that the core region, formed by the 1 5:48 pair and 
by [13:20]:46, has identity functions. The G1 5:G48 
was shown to be a Cys identity element. 26 The core 
region was shown to be important for tRNA Pro iden- 
tity 69 and for discriminating tRNA Glu from 
tRNA As P 30 whi | e jn tne b acter j a | set, the 1 5:48 pair 

has medium level (green zone) AEV, positions 22 
and 46 have high (yellow zone) and position 1 3 
very high (red zone) AEV. The eukaryote set also 
shows high AEVs at these positions (all in the yellow 
zone), suggesting that the core region might have 
similar identity roles in the eukaryotic kingdom as 
well. 70 

3.6. Conclusions 

Deciphering identity elements of tRNAs has been 
one of the most interesting problems of molecular 
biology that has a long and successful history. 2-5 It 
is unquestionable that only properly designed muta- 
tions combined with in vitro and/or in vivo experi- 
ments can identify such elements beyond any 
doubts. However, the large number of even the 



No. 3] 



A. Szenes and G. Pal 



255 



'reasonable' mutations and the astronomical number 
of their combinations renders these laborious studies 
rather challenging. As a consequence, the existing 
collection of already determined identity element 
set cannot be considered complete, and several 
bioinformatics studies aimed to predict hitherto 
hidden such elements. 6,1 3-1 7 Most of these studies 
applied classical sequence analysing tools and 
searched for conserved sequence features. 

The ECP analysis, on the other hand, is based on 
strictly absent elements and provides a simple but 
meaningful measure of pairwise mathematical dis- 
tances of functionally related sequence sets. 18 

The reliability of bioinformatic studies is strongly 
related to the number of input sequences that are 
compared. Individual species have only a few tRNA 
sequences for each identity set. Comparing tRNA 
sequences from a pool of related species could 
improve the signal/noise ratio of the analysis, but it 
can be justified only if the comparison is functionally 
relevant, i.e. tRNA from one species would be properly 
charged by the corresponding AARS from the other 
species. 

We produced starting data sets, in which sequences 
grouped for the 20 identity from bacteria were fil- 
tered for the presence of major coli identity elements 
while eukaryotic sequences for the presence of yeast 
identity elements. We argue that common presence 
of such identity elements in the starting data set 
favours interspecies compatibility. If so, isofunctional 
tRNAs from different species sharing already pub- 
lished identity elements should also share their yet 
undiscovered determinants as well. It means that 
such an analysis should identify elements that have 
not been recognized yet and in the same time identify 
those too that had already been published, but not 
included in the filtering set. If hitherto unknown ele- 
ments are detected, they should be considered poten- 
tial identity elements only in bacterial or eukaryotic 
species that also use the coli or the yeast identity 
element sets, respectively, that were applied for filter- 
ing. Thus, this analysis could clearly miss elements 
that are not conserved across species. 

We demonstrated that positional AEVs measure 
sequence feature distances of functional groups and 
correlate with the number of already identified deter- 
minants (Table 1 ). Based on the above arguments, we 
suggested that positions having high AEV but few or 
no identified determinants predict locations of 
hidden identity elements. We listed the most charac- 
teristic such positions and for one of these suggested 
a structural rationale for being identity element. 

We also showed that omitting the second filtration 
in the case of both Bacteria and Eukarya still preserves 
the characteristic pattern of high AEV positions, which 
overlaps with the positions of the most important 



identity elements such as the anticodon bases and 
the discriminator base. 

In this study, we have demonstrated that the ECP 
algorithm is capable of assessing the level of discrim- 
inating power of positions in separating functionally 
different sequence sets. As genomic sequencing con- 
tinues at an increasing rate, our ECP analysis can be 
performed over and over on ever-increasing filtered 
databases. 

We believe that when enough input data are avail- 
able, we can analyse the discriminating power not 
only of individual positions but combinations of posi- 
tions as well. 

Nevertheless, only carefully designed mutations and 
laboratory experiments can assess the predicting 
potentials of the ECP algorithm to identify hidden 
determinants. 
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